# Mini Project(marathon result 2017)

This mini project will review marathon result dataset in different countries in 2017,with some main variables such as gender,age,overal ranking,city,country which I worked on them mainly also some other variable to be used created for this report such as finishing time. Here I listed some main steps to start working on this dataset:

Step 1: Using the function read_csv() to read and variables are assigned for further use.

Step 2: Checking for missed data.

Give an overview on data by head command:

head(marathon_2017)
##   Bib             Name Age M.F          City State Country     X5K    X10K
## 1  11  Kirui, Geoffrey  24   M      Keringet           KEN 0:15:25 0:30:28
## 2  17      Rupp, Galen  30   M      Portland    OR     USA 0:15:24 0:30:27
## 3  23    Osako, Suguru  25   M  Machida-City           JPN 0:15:25 0:30:29
## 4  21 Biwott, Shadrack  32   M Mammoth Lakes    CA     USA 0:15:25 0:30:29
## 5   9   Chebet, Wilson  31   M      Marakwet           KEN 0:15:25 0:30:28
## 6  15 Abdirahman, Abdi  40   M       Phoenix    AZ     USA 0:15:25 0:30:28
##      X15K    X20K    Half    X25K    X30K    X35K    X40K    Pace Proj.Time
## 1 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:02:53 0:04:57         -
## 2 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19 2:03:14 0:04:58         -
## 3 0:45:44 1:01:16 1:04:36 1:17:00 1:33:01 1:48:31 2:03:38 0:04:59         -
## 4 0:45:44 1:01:19 1:04:45 1:17:00 1:33:01 1:48:58 2:04:35 0:05:03         -
## 5 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:41 2:05:00 0:05:04         -
## 6 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:52 2:05:21 0:05:04         -
##   Official.Time Overall Gender Division
## 1       2:09:37       1      1        1
## 2       2:09:58       2      2        2
## 3       2:10:28       3      3        3
## 4       2:12:08       4      4        4
## 5       2:12:35       5      5        5
## 6       2:12:45       6      6        1
##           Bib          Name           Age           M.F          City 
##             0             0             0             0             0 
##         State       Country           X5K          X10K          X15K 
##             0             0             0             0             0 
##          X20K          Half          X25K          X30K          X35K 
##             0             0             0             0             0 
##          X40K          Pace     Proj.Time Official.Time       Overall 
##             0             0             0             0             0 
##        Gender      Division 
##             0             0

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

!colSums(is.na(marathon_2017))
##           Bib          Name           Age           M.F          City 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##         State       Country           X5K          X10K          X15K 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          X20K          Half          X25K          X30K          X35K 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##          X40K          Pace     Proj.Time Official.Time       Overall 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##        Gender      Division 
##          TRUE          TRUE

To have a demoghraphy from runners I applied summary comand:

by(marathon_2017$Age, marathon_2017$M.F, summary)
## marathon_2017$M.F: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   40.00   39.95   48.00   84.00 
## ------------------------------------------------------------ 
## marathon_2017$M.F: M
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   36.00   45.00   44.77   53.00   83.00

The youngest female finishers were 18 years old, and oldest female ones were 84, also youngest male finishers were 18 years old and oldest male runners were 83 years old.

library(ggplot2)
library(ggthemes)
ggplot(aes(x = Age, fill =M.F), data = marathon_2017) +
geom_histogram(position = position_dodge()) +
scale_x_continuous(breaks = round(seq(min(marathon_2017$Age), max(marathon_2017$Age),by = 5),10)) +
scale_colour_wsj() + theme_wsj()+
labs(title='Age and Gender of Runners')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It seemes women are more eager to marathon when they are younger,specially when they are 25 and again when they are around 40,but as they becoe older,they lose eagerness to run,but in reverse men become interested to marathon slow by slow when they become older,specifically when they become 40,45,50 they love to join marathon races.after turning to 50 always more number of men are interested to run compare to women.

library(ggthemes)
ggplot(aes(x=Gender, y=Age, fill=M.F), data = marathon_2017) +
geom_boxplot() +
theme_economist()+ scale_color_economist() +theme(plot.title = element_text(size = 20, face = "bold", color = "tomato", hjust = 0.5))+
labs(title=' Boxplot of Runners by Age and Gender')

Intersting point in this plot is Female runnersare younger than male runners in first quartile, median and third quartile.Median for women is at their 40 but for men is at their 45.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
marathon_2017 %>%
mutate(Official_Time=lubridate::hms(Official.Time)) %>%
mutate(New_time = lubridate::period_to_seconds(Official_Time)) %>% 
group_by(M.F) %>%
summarize(Ave_Time=mean(New_time, na.rm=T),Ave_Age=mean(Age,na.rm=T),.group="drop")
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   M.F   Ave_Time Ave_Age .group
##   <fct>    <dbl>   <dbl> <chr> 
## 1 F       14945.    40.0 drop  
## 2 M       13735.    44.8 drop

It could be seen Average time of finishing the race for women is more than men,also the women who end the race are younger compare to men.

New variable is defined as New_time to have finishing time of all the runners in seconds for easier analysis:

New_marathon <- marathon_2017 %>% 
mutate(official_time = lubridate::hms(Official.Time)) %>% 
mutate(New_time = lubridate::period_to_seconds(official_time)) %>% 
select(Age, M.F, official_time, New_time, everything())
New_marathon %>% 
group_by(M.F, Age) %>% 
summarize(ave_time = mean(New_time, na.rm = T), 
ave_age = mean(Age, na.rm = T),
.groups = "drop")
## # A tibble: 125 x 4
##    M.F     Age ave_time ave_age
##    <fct> <int>    <dbl>   <dbl>
##  1 F        18   17288.      18
##  2 F        19   15924.      19
##  3 F        20   15859.      20
##  4 F        21   14660.      21
##  5 F        22   15283.      22
##  6 F        23   14856.      23
##  7 F        24   14909.      24
##  8 F        25   14600.      25
##  9 F        26   14351.      26
## 10 F        27   14818.      27
## # … with 115 more rows
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
New_marathon %>% 
group_by(M.F, Age) %>% 
summarize(ave_time = mean(New_time, na.rm = T), 
ave_age = mean(Age, na.rm = T),.groups = "drop") %>% 
ggplot() + 
geom_point(aes(x = Age, y = ave_time, color = M.F))+labs(title="Trend of Average Finishing Time VS. Age and Gender")+theme(legend.position = "bottom")+theme(plot.title = element_text(size = 16, face = "bold",color = "tomato"))

ggplotly()

This plot shows that if we like or dislike , ageis really important! As you see when people are between 20 to 40 years old average finishing time is much less compare to when they become older. Second interesting point is that usually in all periods of life men finish a marathon race sooner than women.Specially when they are so old (after 70 years old) men run more faster than women.

New_marathon %>% 
group_by(M.F, Age) %>% 
summarize(ave_time = mean(New_time, na.rm = T), 
ave_age = mean(Age, na.rm = T),
.groups = "drop") %>% 
ggplot() + 
geom_density(aes(x = ave_time,color = M.F))+labs(title=" Average Finishing Time of Runners")+theme(plot.title = element_text(size = 20, face = "bold",color = "tomato"))

ggplotly()

Again we see there is a gender gap between male and female. on average male run faster than female. This gap is greater among the runners who finish sooner than among those who finish last. The peak finish time for male crossed finished line was in 13237 sec and 14825 sec for female.

New_marathon <- marathon_2017 %>% 
mutate(official_time = lubridate::hms(Official.Time)) %>% 
mutate(New_time = lubridate::period_to_seconds(official_time)) %>% 
select(Age, M.F, official_time, New_time, everything())
Runners<-New_marathon %>%
filter(Country=="JPN"|Country=="CAN"|Country=="ITA")
view(Runners)

With above commands we filter only 3 countries of Japan,Canada,Italy to analyse them in more detailes.

ggplot(data=Runners,mapping=aes(x=New_time,fill=M.F))+
geom_histogram(color = "darkorchid4")+labs(title="Gender distrubution in Different finishing Times for Specific 3 countries",subtitle="Japan,Italy,Canada")+scale_color_brewer(palette = "Pastel2")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplotly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For only 3 selected country from 3 different continent,I checked gender distrubition in different finishing times,most of the men run in 12800 seconds while most of the women run in 13400 seconds.

#Filter fast runners who finish game in less than *9000 seconds*
Runners<-New_marathon %>%
filter(New_time<9000)
view(Runners)
library(ggrepel)
ggplot(data=Runners,mapping=aes(x=New_time,y=Age,color=M.F)) + 
geom_point()+geom_line()+
xlab('Finish Time(sec)') +scale_color_fivethirtyeight()+
labs(title='Fast Runners Finishing Time')+geom_label_repel(aes(label = Overall))+theme(plot.title = element_text(size = 20, face = "bold",color = "tomato"))

This plot shows overall rank of a runner with its gender in finishing time of less than 9000 seconds. First one is a male who finishes in 7750 seconds and the first fast female runner finishes at 8600* seconds and her rank is 21th.

filter(New_marathon, M.F == "M",Country=="USA") %>%
mutate(Half_time =lubridate::hms(X10K)) %>% 
mutate(New_Halftime =lubridate::period_to_seconds(Half_time)) %>%  
mutate(official_time = lubridate::hms(Official.Time)) %>% 
mutate(New_time = lubridate::period_to_seconds(official_time)) %>%   
ggplot(aes(x=New_time, y=New_Halftime, rm.na=T)) +
geom_point(aes(colour = Country)) + 
geom_smooth(method='lm')+labs(title="Correlation of Halfway and Finishing Time",subtitle="USA")+coord_cartesian(xlim = c(7000, 25000),ylim = c (2000, 6000) )+
annotate(geom = "text",x = 15000, y = 4000,label = "Highly Correlated!")+theme(plot.title = element_text(size = 20, face = "bold",color = "tomato"))+xlab('Finish Time(sec)')+ylab("halftime(sec)")
## Warning: Problem with `mutate()` input `Half_time`.
## ℹ Some strings failed to parse, or all strings are NAs
## ℹ Input `Half_time` is `lubridate::hms(X10K)`.
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed to
## parse, or all strings are NAs
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).

There is a correlation between finishing time and half time finishing in USA.

New_marathon1<-New_marathon %>% 
mutate(Half_time =lubridate::hms(X10K)) %>% 
mutate(New_Halftime =lubridate::period_to_seconds(Half_time)) %>%  
mutate(official_time = lubridate::hms(Official.Time)) %>% 
mutate(New_time = lubridate::period_to_seconds(official_time)) 
## Warning: Problem with `mutate()` input `Half_time`.
## ℹ Some strings failed to parse, or all strings are NAs
## ℹ Input `Half_time` is `lubridate::hms(X10K)`.
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed to
## parse, or all strings are NAs
head(New_marathon1)
##   Age M.F official_time New_time Bib             Name          City State
## 1  24   M     2H 9M 37S     7777  11  Kirui, Geoffrey      Keringet      
## 2  30   M     2H 9M 58S     7798  17      Rupp, Galen      Portland    OR
## 3  25   M    2H 10M 28S     7828  23    Osako, Suguru  Machida-City      
## 4  32   M     2H 12M 8S     7928  21 Biwott, Shadrack Mammoth Lakes    CA
## 5  31   M    2H 12M 35S     7955   9   Chebet, Wilson      Marakwet      
## 6  40   M    2H 12M 45S     7965  15 Abdirahman, Abdi       Phoenix    AZ
##   Country     X5K    X10K    X15K    X20K    Half    X25K    X30K    X35K
## 1     KEN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19
## 2     USA 0:15:24 0:30:27 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:19
## 3     JPN 0:15:25 0:30:29 0:45:44 1:01:16 1:04:36 1:17:00 1:33:01 1:48:31
## 4     USA 0:15:25 0:30:29 0:45:44 1:01:19 1:04:45 1:17:00 1:33:01 1:48:58
## 5     KEN 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:41
## 6     USA 0:15:25 0:30:28 0:45:44 1:01:15 1:04:35 1:16:59 1:33:01 1:48:52
##      X40K    Pace Proj.Time Official.Time Overall Gender Division Half_time
## 1 2:02:53 0:04:57         -       2:09:37       1      1        1   30M 28S
## 2 2:03:14 0:04:58         -       2:09:58       2      2        2   30M 27S
## 3 2:03:38 0:04:59         -       2:10:28       3      3        3   30M 29S
## 4 2:04:35 0:05:03         -       2:12:08       4      4        4   30M 29S
## 5 2:05:00 0:05:04         -       2:12:35       5      5        5   30M 28S
## 6 2:05:21 0:05:04         -       2:12:45       6      6        1   30M 28S
##   New_Halftime
## 1         1828
## 2         1827
## 3         1829
## 4         1829
## 5         1828
## 6         1828

There is a high correlation of 0.903 between finishing time and halfway time in USA for maraton runners.As hifhway time increases so does finishing time.

library(ggrepel)
filter(New_marathon,Overall<=20)%>% 
mutate(official_time = lubridate::hms(Official.Time)) %>% 
mutate(New_time = lubridate::period_to_seconds(official_time)) %>%     
ggplot(fill="red")+theme(plot.title = element_text(size = 20, face = "bold",color = "tomato"))+
geom_col(aes(x = Overall,y= New_time,color="red"))+labs(title="Finishing Time of 20 First Runners")+xlab("Overall Rank")+ylab("Finishing Time")

ggplotly()

Filtering the first 20 runners,then plot overall ranks of them versus finishing time.First runners finish it in 7777 seconds,second one in 20 seconds later. The last one in 8418 seconds,so they really finish the race in a very close range.

New_marathon %>%
filter(Age<=25) %>%
ggplot(aes(x=Country,colour = Country,rm.na=T)) +
geom_bar(fill="pink") + 
labs(title="Countries with youngest runners",xaxis ="Number of runners", yaxis = "Country")+theme(axis.text.x =element_text(angle=65,vjust=0.6))+coord_flip()+theme_void()

ggplotly()

Above plot shows countries which have youngest runnersunder 25 years old,as it is obvious USA with 1666 young runners has the most youngest runners among other countries. with a huge difference canada is the second country with 43 young runners under 25 years old.

#spatial visualization:

library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
world <- ne_countries(scale = "medium", returnclass = "sf")
Runner_map <- read.csv("https://raw.githubusercontent.com/reisanar/datasets/master/marathon_results_2017.csv")
  # Rename country code column to ISO_A3 so it matches what's in the Natural Earth shapefile
Runner_map <- rename(Runner_map, iso_a3 = Country)  
new_map <- world %>%
left_join(Runner_map, by = "iso_a3") 
participants <- Runner_map %>% 
group_by(iso_a3) %>% 
count()
participants %>% 
arrange(desc(n))
## # A tibble: 91 x 2
## # Groups:   iso_a3 [91]
##    iso_a3     n
##    <fct>  <int>
##  1 USA    20945
##  2 CAN     1870
##  3 GBR      425
##  4 MEX      285
##  5 CHN      242
##  6 GER      226
##  7 BRA      205
##  8 AUS      191
##  9 JPN      170
## 10 ITA      165
## # … with 81 more rows
marathon <- world %>% 
left_join(participants, by = "iso_a3") 
ggplot(data = marathon) +
  geom_sf(aes(fill = n), color = "grey") +
  scale_fill_gradient(low = "#bfafd4",
                      high = "#857a94",
                      na.value = "#f8f7fa") +
  xlab("Longitude") + ylab("Latitude") +
  labs(
    title = "Runner Population in Every Country",
    barwidth = 15,
    barheight = 0.5,
    fill = NULL
  ) + theme_void()+theme(plot.title = element_text(size = 18, face = "bold",color = "tomato",hjust=0.5)) 

ggplotly()
Age_Rank <- Runner_map %>%
  filter(Age<= 40,Overall<=100)

gender_map <- world %>%
  left_join(Age_Rank, by = "iso_a3")
Young_Ranking <- Age_Rank %>% 
group_by(iso_a3) %>% 
count()
Young_Ranking %>% 
arrange(desc(n))
## # A tibble: 10 x 2
## # Groups:   iso_a3 [10]
##    iso_a3     n
##    <fct>  <int>
##  1 USA       66
##  2 KEN        8
##  3 CAN        4
##  4 ETH        4
##  5 JPN        3
##  6 BDI        1
##  7 BRN        1
##  8 GBR        1
##  9 MEX        1
## 10 ZIM        1
marathon <- world %>% 
left_join(Young_Ranking, by = "iso_a3") 
ggplot(data = marathon) +
  geom_sf(aes(fill = n), color = "grey") +
  scale_fill_gradient(low = "#bfafd4",
                      high = "#857a94",
                      na.value = "#f8f7fa") +
  xlab("Longitude") + ylab("Latitude") +
  labs(
    title = "Young Faster Runners Population in Every Country",
    barwidth = 15,
    barheight = 0.5,
    fill = NULL
  ) + theme_void()+theme(plot.title = element_text(size = 16, face = "bold",color = "tomato",hjust=0.5))

ggplotly()
Gender_Rank <- Runner_map %>%
  filter(M.F== "M")

gender_map <- world %>%
  left_join(Gender_Rank, by = "iso_a3")
Gender_Ranking <- Gender_Rank %>% 
group_by(iso_a3) %>% 
count()
Gender_Ranking %>% 
arrange(desc(n))
## # A tibble: 84 x 2
## # Groups:   iso_a3 [84]
##    iso_a3     n
##    <fct>  <int>
##  1 USA    10788
##  2 CAN     1055
##  3 GBR      290
##  4 MEX      190
##  5 CHN      165
##  6 GER      163
##  7 BRA      146
##  8 JPN      129
##  9 ITA      128
## 10 AUS      120
## # … with 74 more rows
marathon <- world %>% 
left_join(Gender_Ranking, by = "iso_a3") 
ggplot(data = marathon) +
  geom_sf(aes(fill = n), color = "grey") +
  scale_fill_gradient(low = "#bfafd4",
                      high = "#857a94",
                      na.value = "#f8f7fa") + 
  xlab("Longitude") + ylab("Latitude") +
  labs(title = "Male Runners Population in every country",barwidth = 15,barheight = 0.5,fill = NULL)+
 theme_void()+theme(plot.title = element_text(size = 16, face = "bold",color = "tomato",hjust=0.5))

ggplotly()
Gender_Rank <- Runner_map %>%
  filter(M.F== "F",Age>=50)

gender_map <- world %>%
  left_join(Gender_Rank, by = "iso_a3")
Gender_Ranking <- Gender_Rank %>% 
group_by(iso_a3) %>% 
count()
Gender_Ranking %>% 
arrange(desc(n))
## # A tibble: 45 x 2
## # Groups:   iso_a3 [45]
##    iso_a3     n
##    <fct>  <int>
##  1 USA     1887
##  2 CAN      264
##  3 GBR       31
##  4 GER       28
##  5 JPN       25
##  6 ITA       19
##  7 CHN       16
##  8 AUS       14
##  9 MEX       14
## 10 BRA        9
## # … with 35 more rows
marathon <- world %>% 
left_join(Gender_Ranking, by = "iso_a3") 
ggplot(data = marathon) +
  geom_sf(aes(fill = n), color = "grey") +
  scale_fill_gradient(low = "#bfafd4",
                      high = "#857a94",
                      na.value = "#f8f7fa") + 
  xlab("Longitude") + ylab("Latitude") +
  labs(title = "Which Countries have Older Female Runners Population?",barwidth = 15,barheight = 0.5,fill = NULL)+
 theme_void()+theme(plot.title = element_text(size = 15, face = "bold",color = "tomato",hjust=0.5))

ggplotly()

conclusion

I chose this dataset because when I was studying my bachelor I joind marathon group of my previous university and I had a memorable time those days.So his dataset could decode some new aspects of this sport for me in other countries. I started ploting some basic plot at first to have a general opinion about the whole topic,then shifted to some interactive plot and finally some spatial plot.bellow you can find some interesting issues which I found from these plot:

Principles of data visualization:

Geometrics:

Different type of plots were created including:histogram,scatterplot,coloumn bar,density,box plot to visualize different aspects of this dataset.

Design:

  • Gestalt Principles tried to be considered specially items such as proximity,similarity,continuty and closure.

  • The Big Four: CRAP

  1. Contrast*: make different by different fonts and color.

  2. Repetition*:Repeat visual elements (colors, fonts, shapes) to create a strong unified theme.

  3. Alignment*: try to have a visual connection among items in the plots

  4. Proximity*: group related items

Aesthetics:

  • Using some functions such as theme() function with installing ggthemes package to create different type of theme for every plot.

  • Customizing the look with coord_cartesian and defining x and y limit for our plot.Modify tilte location with hjust,flipping the axis with coord_flip() and geom_label_repel to add label to plot.Finally I used annotation to add specific point of the chart when was useful.

Color Palettes:

  • Using some different color pallet to emphasize on important sections and results.

  • I tried to use bright or dark colors to highlight information that requires greater attention.in this dataset there was a unordered categorical (gender) required distinct colors that will not be easily confused with one another.

  • Moreover, whenever I needed a special color a Hexadecimal color was used.

  • Drawing spetial plot was challenging for me at first and I got many guidance from Dr.Sanchez,but after being able to a variety of plots their interpretation was interesting for me, also as abbraviation of countries names were avialable,package of rnaturalearth was used and by using function of leftjoin dataframe of marathon_2017 and datashape of world were joined to eachother.